The following machine learning project focuses on…
1 Introduction
1.1 Overview and Motivation
1.1.1 Context and Background
The Swiss real estate market, characterized by its resilience and complexity, presents a significant opportunity for advanced analytical approaches to understand pricing dynamics. This project, undertaken as part of a Master’s degree in Machine Learning at the University of Lausanne, aims to harness the power of data science to predict real estate market prices in Switzerland. Utilizing contemporary machine learning techniques within this academic framework not only enhances the learning experience but also contributes to a practical understanding of real estate valuation.
As housing prices continue to fluctuate amid economic uncertainties, such as interest rate changes and demographic shifts Credit Suisse, this investigation is not only timely but also of significant importance to potential investors, policymakers, and the academic community.
1.1.2 Aim Of The Investigation
The primary objective of this study is to predict Swiss real estate market prices using real-time data scraped from ImmoScout24, a prominent Swiss real estate website. This study addresses the significant question of
How can machine learning models utilize real-time data scraped from online real estate platforms to predict price trends in the Swiss real estate market?
How can machine learning models predict the sale prices of real estate properties in Switzerland based on current market data?
The relevance of this investigation is underpinned by the substantial financial implications of real estate investments and the benefit of predictive insights for both buyers and sellers in the market. The relevance of this study is underscored by the potential insights it could offer, where real estate plays a pivotal role in financial stability and growth.
1.1.3 Description of the Data
The data for this study consists of a meticulously compiled dataset from ImmoScout24, featuring a wide array of variables related to property listings across Switzerland. Fields in the dataset include price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points have been gathered through a robust scraping algorithm designed to collect a comprehensive snapshot of the current market. This dataset provides a granular view of the market, essential for training robust machine learning models.
1.1.4 Methodology
This project employs model-based machine learning techniques to quantify the impact of various factors on property prices in Switzerland. The methodology involves training predictive models to interpret the complex relationships within the data, providing a statistical basis for price prediction. This approach allows for an examination of both linear dependencies and more intricate interactions, such as how location and property type combine to influence pricing.
1.1.5 Structure of the Report
The report is structured as follows to provide a coherent narrative and logical flow of analysis:
Section 1: Introduction - Outlines the research context, objectives, and significance.
Section 2: Data - Details the sources, nature, and preprocessing of the data used.
Section 3: Exploratory Data Analysis (EDA) - Analyzes the data to uncover patterns and anomalies.
Section 5: Supervised Learning - Discusses the development and validation of predictive models.
Section 6: Conclusion - Summarizes the findings, discusses the implications, and suggests areas for further research.
2 Data
Sources
Description
Wrangling/cleaning
Spotting mistakes and missing data (could be part of EDA too)
Listing anomalies and outliers (could be part of EDA too)
2.1 Properties Dataset
The dataset is stored in CSV format, comprising real estate listings scraped from the ImmoScout24 platform as.
The instances in the dataset represent individual property listings. Each row corresponds to a unique listing on ImmoScout24.
They are composed of :
Click to show code
# Create a data frame with the informationproperty_info <-data.frame(Feature =c("Price (CHF)", "Number of Rooms", "Square Meters (m²)", "Address", "Canton", "Property Type", "Floor", "Year of Construction"),Description =c("Numerical - Ranges from 25,000 to 26,149,500 CHF.","Numerical - Values from 1 to 27 rooms, representing the total room count excluding bathrooms and kitchen.","Numerical - Property sizes range from 1 to 2,700 m².","Text - Specific address of the property.","Categorical - Indicates the Swiss canton where the property is located.","Categorical - Includes types such as Apartment, Single House, Villa, etc.","Text/Numerical - Specifies if it on the ground floor or not (e.g., ground floor).","Categorical - Grouped into categories such as 0-1919, 1920-1945, ... until 2024"))# Create the kable extra tableproperty_table <-kable(property_info, format ="html") %>%kable_styling(full_width =FALSE, position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))property_table
Feature
Description
Price (CHF)
Numerical - Ranges from 25,000 to 26,149,500 CHF.
Number of Rooms
Numerical - Values from 1 to 27 rooms, representing the total room count excluding bathrooms and kitchen.
Square Meters (m²)
Numerical - Property sizes range from 1 to 2,700 m².
Address
Text - Specific address of the property.
Canton
Categorical - Indicates the Swiss canton where the property is located.
Property Type
Categorical - Includes types such as Apartment, Single House, Villa, etc.
Floor
Text/Numerical - Specifies if it on the ground floor or not (e.g., ground floor).
Year of Construction
Categorical - Grouped into categories such as 0-1919, 1920-1945, ... until 2024
The data wrangling and cleaning process undertaken for our dataset involves a series of methodical steps aimed at refining the data for accurate analysis and modeling. Initially, the script identifies and addresses problematic values in the number_of_rooms field, where non-numeric entries are detected and marked for further cleaning. The price field is then sanitized to remove any non-numeric characters, ensuring all data in this column represents valid numerical values. A filter is applied to exclude properties priced under 20,000 CHF to avoid anomalies that could skew analysis, likely due to data entry errors or unrealistic listing prices.
Clarify any aggregation methods used or reasons behind choosing specific thresholds for filtering data (e.g., why properties priced under 20,000 CHF were excluded).
Further cleaning includes standardizing addresses by stripping unwanted characters at the beginning, enhancing uniformity across this variable. The dataset undergoes a reduction of incomplete data through the removal of rows with any NA values, which helps in maintaining a dataset with complete cases for analysis. Special attention is given to the square_meters field by removing non-digit characters and converting the cleansed string to numeric values, followed by discarding any remaining NA entries that emerge from this transformation.
Explain why we remove NA from m2 column. above is it correct ?
Categorical data in year_category is truncated to maintain consistency and converted into a factor for potential use in modeling. In dealing with the number_of_rooms, non-relevant text (like “rooms” or “room”) is removed, and checks are implemented to discard erroneous data such as entries mistakenly including “m²” or where room counts exceed 100 rooms, likely due to data entry oversights. An additional corrective measure divides the number of rooms by ten for entries unusually higher than 27, assuming these were misreported due to decimal placement errors.
Lastly, the script enhances the readability and standardization of the canton field by capitalizing each entry, which is important for categorical consistency across the dataset. These meticulous steps ensure the dataset’s integrity and readiness for analytical processes, focusing on maintaining a robust, clean, and usable data structure. ::: {.cell layout-align=“center”}
Click to show code
# Identify values causing the issueproblematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]#> Warning: NAs introduced by coercion# Replace non-numeric values with NA#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))# Remove non-numeric characters and convert to numericproperties$price <-as.numeric(gsub("[^0-9]", "", properties$price))# Subset the dataset to exclude rows with price < 20000properties <- properties[properties$price >=20000, ]# Subset the dataset to exclude rows with numbers of rooms < 25#properties <- properties[properties$number_of_rooms <25, ]# Replace incomplete addressesproperties$address <-gsub("^\\W*[.,0-]\\W*", "", properties$address)properties_filtered <-na.omit(properties)properties_filtered$year_category <-substr(properties_filtered$year_category, 1, 9)# Assuming 'year_category' is a column in the 'properties' datasetproperties_filtered$year_category <-as.factor(properties_filtered$year_category)# remove m^2 from column 'square_meters'properties_filtered$square_meters <-as.numeric(gsub("\\D", "", properties_filtered$square_meters))# print how many NA observations left in square_metersprint(sum(is.na(properties_filtered$square_meters)))#> [1] 1056# remove NAproperties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]# add majuscule to cantonproperties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)# # Preprocess the number_of_rooms columnproperties_filtered$number_of_rooms <-gsub(" rooms", "", properties_filtered$number_of_rooms)properties_filtered$number_of_rooms <-gsub(" room", "", properties_filtered$number_of_rooms)# Remove rows with "m²" in the "number_of_rooms" columnproperties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]properties_filtered$number_of_rooms <-as.numeric(properties_filtered$number_of_rooms)# Remove rows with rooms >= 100properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <=100, , drop =FALSE]# Divide cells by 10 if number of rooms is more than 27properties_filtered$number_of_rooms <-ifelse(properties_filtered$number_of_rooms >27, properties_filtered$number_of_rooms /10, properties_filtered$number_of_rooms)# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms) # Remove non-numeric characters# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms) # Convert to numeric# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms) # Truncate non-integer values# show 100 first row of cleaned dataset using reactablereactable(head(properties_filtered, 100))
:::
Here we can have a glance at the Summary Statistics: ::: {.cell layout-align=“center”}
The AMTOVZ_CSV_LV95 Data, provided by the Swiss Federal Office of Topography (swisstopo), is an “Official Index of Localities” that encompasses comprehensive geospatial and administrative details of all localities in Switzerland and the Principality of Liechtenstein. It includes critical data such as locality names, postal codes, and geographic perimeters.
This dataset, in CSV format, is regularly updated on a monthly basis, incorporating changes reported by cantonal authorities and Swiss Post. It serves various purposes, including spatial analysis, integration with other geographic datasets, usage as a geolocated background in GIS (Geographic Information Systems) and CAD (Computer-Aided Design) systems, statistical analysis, and as a reference dataset for information systems.
Updates and release notes for the dataset are provided periodically, detailing changes and improvements made over time. The Swiss Federal Office of Topography manages and distributes this dataset as part of its responsibilities in collecting and providing official geospatial data for Switzerland.
Click to show code
location_info <-data.frame(Feature =c("Names (Gemeindename)", "Postal Codes (PLZ)", "Canton Codes (Kantonskürzel)", "Longitude (E)", "Latitude (N)"),Description =c("Text - Names of the localities.","Numerical - Postal code of each locality.","Categorical - Short code representing each canton.","Numerical - Geographic coordinate specifying the east-west position.","Numerical - Geographic coordinate specifying the north-south position."))# Create the kable extra tablelocation_table <-kable(location_info, format ="html") %>%kable_styling(full_width =FALSE, position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))# Print the tablelocation_table
Feature
Description
Names (Gemeindename)
Text - Names of the localities.
Postal Codes (PLZ)
Numerical - Postal code of each locality.
Canton Codes (Kantonskürzel)
Categorical - Short code representing each canton.
Longitude (E)
Numerical - Geographic coordinate specifying the east-west position.
Latitude (N)
Numerical - Geographic coordinate specifying the north-south position.
The instances in the dataset represent individual localities within Switzerland and the Principality of Liechtenstein.
2.1.3.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95
A new zip_code variable is created by extracting numeric values from the address column of the filtered properties dataset (properties_filtered). This is done using a regular expression that removes non-digit characters, assuming zip codes are embedded within the address string. The resulting zip codes are then cleaned to ensure they are four-digit numbers, removing any leading digits if the extracted value exceeds four digits. ::: {.cell layout-align=“center”}
Click to show code
df <- properties_filtered#the address column is like : '1844 Villeneuve VD' and has zip code number in it#taking out the zip code number and creating a new column 'zip_code'#the way to identify the zip code is to identify numbers that are 4 digits longdf$zip_code <-as.numeric(gsub("\\D", "", df$address))#removing the first two number of zip code has more than 4 numberdf$zip_code <-ifelse(df$zip_code >9999, df$zip_code %%10000, df$zip_code)
:::
The properties_filtered dataset is merged with the prepared AMTOVZ_CSV_LV95 dataframe based on the zip_code, aiming to enrich property listings with accurate locality names, canton information, and geographic coordinates.
Click to show code
#read .csv AMTOVZ_CSV_LV95amto <-read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep =";")#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]#renaming the columnscolnames(amto_df) <-c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')#remove duplicates of zip codeamto_df <- amto_df[!duplicated(amto_df$zip_code),]#add the variable of amto_df to the df if the zip code matchesdf <-merge(df, amto_df, by ="zip_code", all.x =TRUE)#check if there are nan in citydf[is.na(df$Community),]#> zip_code price number_of_rooms square_meters#> 1 25 2200000 10.0 263#> 2 25 2200000 6.5 165#> 3 26 655000 3.5 66#> 4 26 1995000 7.5 180#> 5 322 870000 2.5 59#> 6 322 880000 2.5 55#> 7 322 975000 3.5 56#> 230 1014 1510000 5.5 146#> 1137 1200 16092000 7.0 400#> 1138 1200 679000 5.5 142#> 1139 1200 3285450 5.0 230#> 5481 1919 2558620 5.5 270#> 5482 1919 1908000 6.5 210#> 5483 1919 1065000 4.5 130#> 5484 1919 785000 3.5 103#> 7624 2500 1100000 5.0 154#> 7625 2500 872500 4.5 144#> 7626 2500 420000 4.5 115#> 7627 2500 1450000 5.5 198#> 7628 2500 885500 5.5 130#> 7629 2500 872500 4.5 138#> 7630 2500 892500 4.5 144#> 7631 2500 885500 5.5 130#> 7632 2500 887500 5.5 130#> 7633 2500 1050000 4.5 121#> 7634 2500 877500 4.5 138#> 7635 2500 870500 4.5 125#> 7636 2500 887500 4.5 144#> 8328 3000 820000 5.5 165#> 8329 3000 1140000 3.5 115#> 8330 3000 1090000 3.5 115#> 8331 3000 1090000 5.5 193#> 8332 3000 1090000 5.5 193#> 8333 3000 720000 3.5 102#> 8334 3000 920000 4.5 157#> 8335 3000 920000 4.5 157#> 8336 3000 1590000 5.5 330#> 10437 4000 975000 4.5 125#> 10438 4000 180000 3.0 70#> 10439 4000 2100000 6.5 360#> 12362 5201 725000 3.5 95#> 13215 6000 695000 4.5 133#> 13968 6511 440000 2.0 64#> 14244 6547 15000000 7.5 220#> 14562 6602 2800000 6.5 250#> 14563 6602 2800000 7.5 242#> 14564 6602 450000 3.5 75#> 14565 6602 270000 1.5 28#> 14566 6604 760000 3.5 78#> 14567 6604 1990000 4.5 220#> 14568 6604 2668590 5.5 290#> 16581 6901 3660930 4.5 290#> 16582 6901 3660930 4.5 290#> 16583 6903 790000 3.5 105#> 16584 6907 995000 4.5 114#> 16585 6907 995000 4.5 114#> 16586 6911 469350 5.5 140#> 16587 6911 737550 4.5 82#> 16588 6911 660000 7.5 200#> 16589 6911 610000 3.5 103#> 17900 7133 2266290 5.5 160#> 17909 7135 2690000 8.5 236#> 18169 8000 2100000 4.5 152#> 18170 8000 1650000 4.5 142#> 18171 8000 925000 3.5 102#> 18172 8000 1650000 4.5 142#> 18173 8000 1150000 4.5 128#> 18174 8000 1450000 5.5 143#> 18175 8000 1990000 5.5 200#> 18176 8000 975000 4.5 122#> 18177 8000 1990000 5.5 200#> 18178 8000 2495000 5.5 482#> 18658 8238 245000 2.0 49#> 19082 8423 2110000 6.5 204#> 19083 8423 2190000 5.5 167#> 20296 9241 545000 4.5 100#> 20297 9241 730840 5.5 130#> address#> 1 1000 Lausanne 25#> 2 1000 Lausanne 25#> 3 1000 Lausanne 26#> 4 Lausanne 26, 1000 Lausanne 26#> 5 Via Cuolm Liung 30d, 7032 Laax GR 2#> 6 7032 Laax GR 2#> 7 Via Murschetg 29, 7032 Laax GR 2#> 230 1014 Lausanne#> 1137 1200 Genève#> 1138 Chemin des pralets, 74100 Etrembières, 1200 Genève#> 1139 1200 Genève#> 5481 1919 Martigny#> 5482 1919 Martigny#> 5483 1919 Martigny#> 5484 1919 Martigny#> 7624 2500 Biel/Bienne#> 7625 2500 Biel/Bienne#> 7626 2500 Biel/Bienne#> 7627 2500 Bienne#> 7628 2500 Biel/Bienne#> 7629 2500 Biel/Bienne#> 7630 2500 Biel/Bienne#> 7631 2500 Biel/Bienne#> 7632 2500 Biel/Bienne#> 7633 Hohlenweg 11b, 2500 Biel/Bienne#> 7634 2500 Biel/Bienne#> 7635 2500 Biel/Bienne#> 7636 2500 Biel/Bienne#> 8328 3000 Bern#> 8329 3000 Bern#> 8330 3000 Bern#> 8331 3000 Bern#> 8332 3000 Bern#> 8333 3000 Bern#> 8334 3000 Bern#> 8335 3000 Bern#> 8336 3000 Bern#> 10437 4000 Basel#> 10438 Lörrach Brombach Steinsack 6, 4000 Basel#> 10439 4000 Basel#> 12362 5201 Brugg AG#> 13215 in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern#> 13968 6511 Cadenazzo#> 14244 Augio 1F, 6547 Augio#> 14562 6602 Muralto#> 14563 6602 Muralto#> 14564 Via Bacilieri 2, 6602 Muralto#> 14565 6602 Muralto#> 14566 6604 Locarno#> 14567 6604 Solduno#> 14568 6604 Solduno#> 16581 6901 Lugano#> 16582 6901 Lugano#> 16583 6903 Lugano#> 16584 6907 MASSAGNO#> 16585 6907 MASSAGNO#> 16586 6911 Campione d'Italia#> 16587 6911 Campione d'Italia#> 16588 6911 Campione d'Italia#> 16589 6911 Campione d'Italia#> 17900 Inder Platenga 34, 7133 Obersaxen#> 17909 7135 Fideris#> 18169 8000 Zürich#> 18170 8000 Zürich#> 18171 8000 Zürich#> 18172 8000 Zürich#> 18173 8000 Zürich#> 18174 8000 Zürich#> 18175 8000 Zürich#> 18176 8000 Zürich#> 18177 8000 Zürich#> 18178 8000 Zürich#> 18658 Stemmerstrasse 14, 8238 Büsingen am Hochrhein#> 19082 Chüngstrasse 60, 8423 Embrach#> 19083 Chüngstrasse 48, 8423 Embrach#> 20296 9241 Kradolf#> 20297 9241 Kradolf#> canton property_type floor year_category Community#> 1 Vaud Single house 1919-1945 <NA>#> 2 Vaud Villa 2006-2010 <NA>#> 3 Vaud Apartment noteg 2016-2024 <NA>#> 4 Vaud Villa 1961-1970 <NA>#> 5 Grisons Apartment eg 2016-2024 <NA>#> 6 Grisons Apartment noteg 2016-2024 <NA>#> 7 Grisons Apartment noteg 2011-2015 <NA>#> 230 Vaud Apartment eg 2011-2015 <NA>#> 1137 Geneva Single house 2011-2015 <NA>#> 1138 Geneva Bifamiliar house 2016-2024 <NA>#> 1139 Geneva Bifamiliar house 1981-1990 <NA>#> 5481 Valais Attic flat noteg 2016-2024 <NA>#> 5482 Valais Apartment noteg 2016-2024 <NA>#> 5483 Valais Apartment noteg 2016-2024 <NA>#> 5484 Valais Apartment noteg 2016-2024 <NA>#> 7624 Bern Single house 2001-2005 <NA>#> 7625 Bern Villa 2016-2024 <NA>#> 7626 Bern Apartment noteg 1971-1980 <NA>#> 7627 Bern Single house 2016-2024 <NA>#> 7628 Bern Villa 2016-2024 <NA>#> 7629 Bern Single house 2016-2024 <NA>#> 7630 Bern Single house 2016-2024 <NA>#> 7631 Bern Single house 2016-2024 <NA>#> 7632 Bern Single house 2016-2024 <NA>#> 7633 Bern Single house 2001-2005 <NA>#> 7634 Bern Single house 2016-2024 <NA>#> 7635 Bern Single house 2016-2024 <NA>#> 7636 Bern Single house 2016-2024 <NA>#> 8328 Bern Apartment noteg 2016-2024 <NA>#> 8329 Bern Apartment eg 2016-2024 <NA>#> 8330 Bern Apartment eg 2016-2024 <NA>#> 8331 Bern Roof flat noteg 2016-2024 <NA>#> 8332 Bern Apartment noteg 2016-2024 <NA>#> 8333 Bern Apartment eg 2016-2024 <NA>#> 8334 Bern Duplex noteg 2016-2024 <NA>#> 8335 Bern Apartment noteg 2016-2024 <NA>#> 8336 Bern Apartment noteg 1991-2000 <NA>#> 10437 Basel-Stadt Single house 2016-2024 <NA>#> 10438 Basel-Stadt Single house 1961-1970 <NA>#> 10439 Basel-Stadt Villa 2016-2024 <NA>#> 12362 Aargau Apartment noteg 2016-2024 <NA>#> 13215 Lucerne Apartment noteg 1991-2000 <NA>#> 13968 Ticino Apartment noteg 2016-2024 <NA>#> 14244 Grisons Single house 2016-2024 <NA>#> 14562 Ticino Single house 1981-1990 <NA>#> 14563 Ticino Single house 1981-1990 <NA>#> 14564 Ticino Apartment noteg 1946-1960 <NA>#> 14565 Ticino Apartment eg 1961-1970 <NA>#> 14566 Ticino Apartment noteg 2011-2015 <NA>#> 14567 Ticino Attic flat noteg 2011-2015 <NA>#> 14568 Ticino Apartment noteg 2011-2015 <NA>#> 16581 Ticino Attic flat noteg 2011-2015 <NA>#> 16582 Ticino Apartment noteg 2011-2015 <NA>#> 16583 Ticino Apartment noteg 2006-2010 <NA>#> 16584 Ticino Apartment noteg 2016-2024 <NA>#> 16585 Ticino Apartment noteg 2016-2024 <NA>#> 16586 Ticino Apartment noteg 1946-1960 <NA>#> 16587 Ticino Apartment noteg 1991-2000 <NA>#> 16588 Ticino Single house 1971-1980 <NA>#> 16589 Ticino Apartment eg 1946-1960 <NA>#> 17900 Grisons Single house 2006-2010 <NA>#> 17909 Grisons Single house 0-1919 <NA>#> 18169 Zurich Apartment noteg 2016-2024 <NA>#> 18170 Zurich Attic flat noteg 2016-2024 <NA>#> 18171 Zurich Apartment noteg 2016-2024 <NA>#> 18172 Zurich Apartment noteg 2016-2024 <NA>#> 18173 Zurich Apartment noteg 2016-2024 <NA>#> 18174 Zurich Apartment eg 2016-2024 <NA>#> 18175 Zurich Apartment noteg 2006-2010 <NA>#> 18176 Zurich Single house 2016-2024 <NA>#> 18177 Zurich Attic flat noteg 2006-2010 <NA>#> 18178 Zurich Apartment noteg 0-1919 <NA>#> 18658 Schaffhausen Apartment noteg 1961-1970 <NA>#> 19082 Zurich Bifamiliar house 2016-2024 <NA>#> 19083 Zurich Single house 2016-2024 <NA>#> 20296 Thurgau Apartment noteg 1991-2000 <NA>#> 20297 Thurgau Apartment noteg 1991-2000 <NA>#> Canton_code lon lat#> 1 <NA> NA NA#> 2 <NA> NA NA#> 3 <NA> NA NA#> 4 <NA> NA NA#> 5 <NA> NA NA#> 6 <NA> NA NA#> 7 <NA> NA NA#> 230 <NA> NA NA#> 1137 <NA> NA NA#> 1138 <NA> NA NA#> 1139 <NA> NA NA#> 5481 <NA> NA NA#> 5482 <NA> NA NA#> 5483 <NA> NA NA#> 5484 <NA> NA NA#> 7624 <NA> NA NA#> 7625 <NA> NA NA#> 7626 <NA> NA NA#> 7627 <NA> NA NA#> 7628 <NA> NA NA#> 7629 <NA> NA NA#> 7630 <NA> NA NA#> 7631 <NA> NA NA#> 7632 <NA> NA NA#> 7633 <NA> NA NA#> 7634 <NA> NA NA#> 7635 <NA> NA NA#> 7636 <NA> NA NA#> 8328 <NA> NA NA#> 8329 <NA> NA NA#> 8330 <NA> NA NA#> 8331 <NA> NA NA#> 8332 <NA> NA NA#> 8333 <NA> NA NA#> 8334 <NA> NA NA#> 8335 <NA> NA NA#> 8336 <NA> NA NA#> 10437 <NA> NA NA#> 10438 <NA> NA NA#> 10439 <NA> NA NA#> 12362 <NA> NA NA#> 13215 <NA> NA NA#> 13968 <NA> NA NA#> 14244 <NA> NA NA#> 14562 <NA> NA NA#> 14563 <NA> NA NA#> 14564 <NA> NA NA#> 14565 <NA> NA NA#> 14566 <NA> NA NA#> 14567 <NA> NA NA#> 14568 <NA> NA NA#> 16581 <NA> NA NA#> 16582 <NA> NA NA#> 16583 <NA> NA NA#> 16584 <NA> NA NA#> 16585 <NA> NA NA#> 16586 <NA> NA NA#> 16587 <NA> NA NA#> 16588 <NA> NA NA#> 16589 <NA> NA NA#> 17900 <NA> NA NA#> 17909 <NA> NA NA#> 18169 <NA> NA NA#> 18170 <NA> NA NA#> 18171 <NA> NA NA#> 18172 <NA> NA NA#> 18173 <NA> NA NA#> 18174 <NA> NA NA#> 18175 <NA> NA NA#> 18176 <NA> NA NA#> 18177 <NA> NA NA#> 18178 <NA> NA NA#> 18658 <NA> NA NA#> 19082 <NA> NA NA#> 19083 <NA> NA NA#> 20296 <NA> NA NA#> 20297 <NA> NA NA
In the data processing steps described, we identified and subsequently removed 77 instances where the Community field was NaN after attempting to merge the property listings with the AMTOVZ_CSV_LV95 dataset.
Reasons for NaN Occurrences:
Zip Code Not Found in the AMTOVZ_CSV_LV95 Dataset. Occurs when the property’s zip code isn’t in the dataset, possibly due to outdated or incorrect listings.
Incorrectly Isolated Zip Code from the Address. Errors in extracting zip codes, often due to irregular address formatting.
Removing entries with NaN values ensures dataset accuracy and reliability for further analysis, crucial in real estate market analysis where geographical accuracy is paramount. Removed them
Click to show code
#remove the rows with nan in cityproperties_filtered <- df[!is.na(df$Community),]reactable(head(properties_filtered, 100))
# read csvimpots <-read.csv(file.path(here(),"data/estv_income_rates.csv"), sep =",", header =TRUE, stringsAsFactors =FALSE)# Remove 1st rowimpots <- impots[-1, ]# Remove 3rd columnimpots <- impots[, -3]# Combine text for columns 4-8impots[1, 4:8] <-"Impôt sur le revenu"# Combine text for columns 9-13impots[1, 9:13] <-"Impôt sur la fortune"# Combine text for columns 14-16impots[1, 14:16] <-"Impôt sur le bénéfice"# Combine text for columns 17-19impots[1, 17:19] <-"Impôt sur le capital"# Combine content of the first 2 rows into the 2nd rowimpots[2, ] <-apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep =" ")))))# Remove 1st rowimpots <- impots[-1, ]# Assign the text to the 1st row and 1st columnimpots[1, 1] <-"Coefficient d'impôt en %"# Replace column names with the content of the first rowcolnames(impots) <- impots[1, ]impots <- impots[-1, ]# Check for missing values in impotsany_missing <-any(is.na(impots))if (any_missing) {print("There are missing values in impots.")} else {print("There are no missing values in impots.")}#> [1] "There are no missing values in impots."# Replace row names with the content of the 3rd columnrow.names(impots) <- impots[, 3]impots <- impots[, -3]# Remove 2nd column (to avoid canton column)impots <- impots[, -2]# Remove impot egliseimpots <- impots[, -c(4:6)]impots <- impots[, -c(6:8)]impots <- impots[, -8]impots <- impots[, -10]# Clean data and convert to numericcleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))# Replace NA values with 0cleaned_impots[is.na(cleaned_impots)] <-0# Check for non-numeric valuesnon_numeric <-sum(!is.na(cleaned_impots) &!is.numeric(cleaned_impots))if (non_numeric >0) {print(paste("Warning: Found", non_numeric, "non-numeric values."))}rownames(cleaned_impots) <-rownames(impots)#reactable(head(cleaned_impots, 100))
2.1.5 Commune Data
2.1.5.1 Cleaning
ajouter source
ajouter description
expliquer blabla
Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%
For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0
Click to show code
# il faudra changer le pathcommune_prep <-read.csv(file.path(here(),"data/commune_data.csv"), sep =";", header =TRUE, stringsAsFactors =FALSE)# We keep only 2019 to have some reference? (2020 is apparently not really complete)commune_2019 <-subset(commune_prep, PERIOD_REF =="2019") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonnecommune_2019 <-subset(commune_2019, STATUS =="A") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# on enlève les lignes qui sont des aggrégatscommune_2019 <-subset(commune_2019, REGION !="Schweiz")commune_2019 <- commune_2019 %>%pivot_wider(names_from = INDICATORS, values_from = VALUE)# Rename columns using the provided mapcommune <- commune_2019 %>%rename(`Population - Habitants`= Ind_01_01,`Population - Densité de la population`= Ind_01_03,`Population - Etrangers`= Ind_01_08,`Population - Part du groupe d'âge 0-19 ans`= Ind_01_04,`Population - Part du groupe d'âge 20-64 ans`= Ind_01_05,`Population - Part du groupe d'âge 65+ ans`= Ind_01_06,`Population - Taux brut de nuptialité`= Ind_01_09,`Population - Taux brut de divortialité`= Ind_01_10,`Population - Taux brut de natalité`= Ind_01_11,`Population - Taux brut de mortalité`= Ind_01_12,`Population - Ménages privés`= Ind_01_13,`Population - Taille moyenne des ménages`= Ind_01_14,`Sécurité sociale - Taux d'aide sociale`= Ind_11_01,`Conseil national - PLR`= Ind_14_01,`Conseil national - PDC`= Ind_14_02,`Conseil national - PS`= Ind_14_03,`Conseil national - UDC`= Ind_14_04,`Conseil national - PEV/PCS`= Ind_14_05,`Conseil national - PVL`= Ind_14_06,`Conseil national - PBD`= Ind_14_07,`Conseil national - PST/Sol.`= Ind_14_08,`Conseil national - PES`= Ind_14_09,`Conseil national - Petits partis de droite`= Ind_14_10)# If no one voted for a party, set as NA -> replacing it with 0 insteadcommune <- commune %>%mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# Removing NAs from Taux de couverture sociale column# Setting the mean as the mean for Switzerland in 2019 (3.2%)mean_taux_aide_social <-3.2# Replace NA values with the meancommune <- commune %>%mutate(`Sécurité sociale - Taux d'aide sociale`=if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#show 100 first rows of commune using reactablereactable(head(commune, 100))
Click to show code
# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)# # # We keep only 2019 to have some reference? (2020 is apparently not really complete)# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# # # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne# commune_2019 <- subset(commune_2019, STATUS == "A") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# # # on enlève les lignes qui sont des aggrégats# commune_2019 <- subset(commune_2019, REGION != "Schweiz")# # commune_2019 <- commune_2019 %>%# pivot_wider(names_from = INDICATORS, values_from = VALUE)# # # Rename columns using the provided map# commune <- commune_2019 %>%# rename(`Population - Habitants` = Ind_01_01,# `Population - Densité de la population` = Ind_01_03,# `Population - Etrangers` = Ind_01_08,# `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,# `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,# `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,# `Population - Taux brut de nuptialité` = Ind_01_09,# `Population - Taux brut de divortialité` = Ind_01_10,# `Population - Taux brut de natalité` = Ind_01_11,# `Population - Taux brut de mortalité` = Ind_01_12,# `Population - Ménages privés` = Ind_01_13,# `Population - Taille moyenne des ménages` = Ind_01_14,# `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,# `Conseil national - PLR` = Ind_14_01,# `Conseil national - PDC` = Ind_14_02,# `Conseil national - PS` = Ind_14_03,# `Conseil national - UDC` = Ind_14_04,# `Conseil national - PEV/PCS` = Ind_14_05,# `Conseil national - PVL` = Ind_14_06,# `Conseil national - PBD` = Ind_14_07,# `Conseil national - PST/Sol.` = Ind_14_08,# `Conseil national - PES` = Ind_14_09,# `Conseil national - Petits partis de droite` = Ind_14_10)# # # If no one voted for a party, set as NA -> replacing it with 0 instead# commune <- commune %>%# mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# # # # Removing NAs from Taux de couverture sociale column# # Setting the mean as the mean for Switzerland in 2019 (3.2%)# mean_taux_aide_social <- 3.2# # # Replace NA values with the mean# commune <- commune %>%# mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#
3 Unsupervised learning
Clustering and/or dimension reduction
We decided to
In order to explore the relationship between real estate prices and external factor, we decided to perform three unsupervised clustering methods on fiscal, demographic and political data sets for each Swiss municipalities. The resulting clusters are then included as features of our supervised models, as the municipalities within those clusters follow roughly the same behaviour.
3.1 Fiscal clustering
First, we performed a k-means clustering on the fiscal data set. The elbow method with the within-sum of squares resulted in 5 clusters.
Click to show code
# Clean data and convert to numericset.seed(123)cleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))cleaned_impots[is.na(cleaned_impots)] <-0# Replace NA values with 0# Scale the featuresscaled_impots <-scale(cleaned_impots)# Perform k-means clusteringk <-2# Initial guess for the number of clusterskmeans_model <-kmeans(scaled_impots, centers = k)# Check within-cluster sum of squares (elbow method)wss <-numeric(10)for (i in1:10) { kmeans_model <-kmeans(scaled_impots, centers = i) wss[i] <-sum(kmeans_model$withinss)}plot(1:10, wss, type ="b", xlab ="Number of Clusters", ylab ="Within groups sum of squares")# Adjust k based on elbow methodk <-5# Perform k-means clustering again with optimal kkmeans_model <-kmeans(scaled_impots, centers = k)# Assign cluster labels to dendrogramclusters <- kmeans_model$cluster# Plot dendrogram#colored_dend <- color_branches(dend, k = 5)#y_zoom_range <- c(0, 80) # Adjust the y-axis range as needed#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)
Click to show code
# Get the cluster centerscluster_centers <- kmeans_model$centers# Create a data frame with cluster centerscluster_centers_df <-data.frame(cluster =1:k, cluster_centers)# Print cluster centers# print(cluster_centers_df)# Calculate the size of each clustercluster_sizes <-table(kmeans_model$cluster)# Print cluster sizes# print(cluster_sizes)# Get the cluster labelscluster_labels <- kmeans_model$cluster# Convert cleaned_impots to a data frameimpots_cluster <-as.data.frame(cleaned_impots)# Add the cluster labels to cleaned_impotsimpots_cluster$cluster <- cluster_labelsrownames(impots_cluster) <-rownames(impots)impots_cluster <- impots_cluster %>%rownames_to_column(var ="Community")
3.2 Demographic clustering
Then, we performed a hierarchical clustering. First, the data was scaled (some features are percentages, some are real values), then the dissimilarity matrix was computed using the Minkowski method, then Ward’s method was used for the linkage.
As the optimal number of clusters for the fiscal data set was determined to be 5, we decided to continue our analysis of the two other data sets with 5 clusters in order to keep the same scale (even though categorical) for the 3 features resulting from the unsupervised clustering. ::: {.cell layout-align=“center”}
Click to show code
# Clustering demographiccols_commune_demographic <-select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the columns, some are total numbers, some are percentagescols_commune_demographic <-scale(cols_commune_demographic)# Calculate the distance matrixdist_matrix_demographic <-dist(cols_commune_demographic, method ="minkowski")# Perform hierarchical clusteringhclust_model_demographic <-hclust(dist_matrix_demographic, method ="ward.D2")# Create dendrogramdend_demo <-as.dendrogram(hclust_model_demographic)dend_demo <-color_branches(dend_demo, k =5) #Set number of cluster to 5, to keep the same scale for all our variablespar(mar =c(0.001, 4, 4, 2) +0.1)plot(dend_demo, main ="Demographics - Hierarchical Clustering Dendrogram", xlab ="")
:::
3.3 Political clustering
The same process was used for our political data set. The share of each major parties voted for the Conseil National are represented. The only difference was that we did not scale our data, as all features are percentages. ::: {.cell layout-align=“center”}
Click to show code
# Clustering politicscols_commune_politics <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Calculate the distance matrixdist_matrix_politics <-dist(cols_commune_politics, method ="minkowski")# Perform hierarchical clusteringhclust_model_politics <-hclust(dist_matrix_politics, method ="ward.D2")# Create dendrogramdend_pol <-as.dendrogram(hclust_model_politics)dend_pol <-color_branches(dend_pol, k =5) #Set number of cluster to 5, to keep the same scale for all our variablespar(mar =c(0.001, 4, 4, 2) +0.1)plot(dend_pol, main ="Politics - Hierarchical Clustering Dendrogram")
:::
Click to show code
# Preparing df_commune for merging with main datasetdf_commune <-select(commune, REGION)df_commune$Demographic_cluster <-cutree(hclust_model_demographic, k =5)df_commune$Political_cluster <-cutree(hclust_model_politics, k =5)# Preparing to mergemerging <-inner_join(amto_df, df_commune, by =c("Community"="REGION"))impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]merging <- merging %>%left_join(impots_cluster_subset, by ="Community")clusters_df <- merging %>%rename(Tax_cluster = cluster) %>%rename(Commune = Community)clusters_df <- clusters_df %>%select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la manoclusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <-2# adding it to our main data set:properties_filtered <-merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by ="zip_code", all.x =TRUE)
Click to show code
# Dropping 228 rows containing NAs after the merge (Problem with names)# Find rows with NA values in the specified columnsna_rows <-subset(properties_filtered, is.na(Demographic_cluster) |is.na(Political_cluster) |is.na(Tax_cluster))# Drop the NA rowsproperties_filtered <-anti_join(properties_filtered, na_rows, by ="zip_code")
Click to show code
# Interpretaion of demographic clustersdemographic_vars <-select(commune, -c("REGION", "CODE_REGION", "Conseil national - PLR", "Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the variablesscaled_demographic_vars <-scale(demographic_vars)# Convert to data framescaled_demographic_vars <-as.data.frame(scaled_demographic_vars)# Add demographic cluster labelsscaled_demographic_vars$Demographic_cluster <-cutree(hclust_model_demographic, k =5)# Melt the dataset to long formatmelted_demographic <-melt(scaled_demographic_vars, id.vars ="Demographic_cluster")# Define color palettemy_colors <-c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")# Calculate the number of boxplotsnum_boxplots <-length(unique(melted_demographic$variable))# Calculate the number of rows and columns for the layoutnum_rows <-ceiling(sqrt(num_boxplots))num_cols <-ceiling(num_boxplots / num_rows)# Set up the layout with reduced marginspar(mfrow =c(num_rows, num_cols), mar =c(5, 5, 1, 1))# Create boxplots for each variablefor (variable inunique(melted_demographic$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_demographic$value[melted_demographic$variable == variable], melted_demographic$Demographic_cluster[melted_demographic$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))# Create boxplot with custom colorsboxplot(value ~ Demographic_cluster, data = melted_demographic[melted_demographic$variable == variable,],main = variable,xlab ="",ylab = variable,ylim =c(ylim_min, ylim_max),col = my_colors)}
The unsupervised clustering method performed on the demographic data of Swiss municipalities return some interesting results.
Our first cluster seems to be for municipalities where a lot of families with children live (“Part du group d’âge 0-19 ans” is high, “Taille moyenne des ménages high)
Cluster 2 and 3 are very similar, with a lot of variables showing no special indication. It is however to note that municipalities in cluster 2 have slightly higher population density than cluster 3, with more foreign inhabitants.
Cluster 4 seems to be for municipalities of large cities (Large and dense population, with most of its inhabitants being 20 to 64 years old)
Cluster 5 seems to be for municipalities with an aging population (“Part du groupe d’âge 65+ ans” and “Taux de mortalité” with high values)
Click to show code
# Subset your dataset to include only the variables used to create the political clusters and the political cluster labelspolitical_vars <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))colnames(political_vars) <-sub("Conseil national - ", "", colnames(cols_commune_politics))# Add political cluster labelspolitical_vars$Political_cluster <-cutree(hclust_model_politics, k =5)# Melt the dataset to long formatmelted_political <-melt(political_vars, id.vars ="Political_cluster")# Define pastel color palettepastel_colors <-c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")# Set up the layout with reduced marginspar(mfrow =c(5, 2), mar =c(3, 3, 1, 1), pty ="s")# Create boxplots for each variablefor (variable inunique(melted_political$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_political$value[melted_political$variable == variable], melted_political$Political_cluster[melted_political$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))# Create boxplot with pastel colorsboxplot(value ~ Political_cluster, data = melted_political[melted_political$variable == variable,],main = variable,xlab ="Political Cluster",ylab = variable,ylim =c(ylim_min, ylim_max),col = pastel_colors)}
The political clusters are more difficult to interpret than the demographic ones. It is however interesting to note for example cluster 3, having high values for PS and PST (lft-leaning parties), and low values for right-leaning parties), while cluster 2 has high values for right-leaning parties.
Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labelstax_vars <-select(impots_cluster, -c("Community", "cluster"))# Scale the variablesscaled_tax_vars <-scale(tax_vars)# Convert to data framescaled_tax_vars <-as.data.frame(scaled_tax_vars)# Add tax cluster labelsscaled_tax_vars$Tax_cluster <- impots_cluster$cluster# Melt the dataset to long formatmelted_tax <-melt(scaled_tax_vars, id.vars ="Tax_cluster")my_colors <-c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")# Set up the layout with reduced marginspar(mfrow =c(3, 3), mar =c(5, 5, 2, 2))# Create boxplots for each variablefor (variable inunique(melted_tax$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_tax$value[melted_tax$variable == variable], melted_tax$Tax_cluster[melted_tax$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))boxplot(value ~ Tax_cluster, data = melted_tax[melted_tax$variable == variable,],main = variable,xlab ="",ylab = variable,ylim =c(ylim_min, ylim_max),col = my_colors)}
The fiscal clusters are also quite difficult to interpret. The only striking cluster is cluster 5, with very high values for cantonal taxes, while having average communal (municipality) taxes.
4 EDA
4.1 Map representation of distribution of properties
Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.
This visualization helps in understanding the geographical spread and density of properties in the dataset.
Click to show code
# Create a leaflet map with optimized markersmap <-leaflet(properties_filtered) %>%addTiles() %>%# Add default OpenStreetMap tilesaddProviderTiles(providers$Esri.NatGeoWorldMap) %>%# Add topographic maps for contextaddCircleMarkers(~lon, ~lat,radius =1.5, # Smaller radius for the circle markerscolor ="#F97300", # Specifying a color for the markersfillOpacity =0.2, # Semi-transparent fillstroke =FALSE, # No border to the circle markers to reduce visual noisepopup =~paste("Price: ", price, "<br>","Rooms: ", number_of_rooms, "<br>","Type: ", property_type, "<br>","Year: ", year_category),label =~paste("Price: ", price) # Tooltip on hover ) %>%addLegend(position ="bottomright", # Position the legend at the bottom rightcolors ="#F97300", # Use the same color as the markerslabels ="Properties"# Label for the legend )map$width <-"100%"# Set the width of the map to 100%map$height <-600# Set the height of the map to 600 pixelsmap
The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.
4.2 Histogram of prices
Click to show code
histogram_price <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="red") +labs(title ="Distribution of Prices",x ="Price",y ="Frequency") +theme_minimal()# Convert ggplot object to plotly objectinteractive_histogram_price <-ggplotly(histogram_price, width =600, height =500 )# Display the interactive histograminteractive_histogram_price
As we can see, the majority of the properties are less than 5 million.
4.3 Price between 0 and 5 millions
To enhance data visibility, we focus on the majority of the data between 5 and 10 million, while filtering out outliers.
4.3.1 Histogram of prices for each property type
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 5000000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create the ggplot objecthistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ property_type, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Property Type",x ="Price",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 5000000))# Convert ggplot object to plotly objectinteractive_histogram <-ggplotly(histogram, width =750, height =1000)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50))# Display the interactive plotinteractive_histogram
Upon first glance, the majority of property types are valued at less than 3 million, with apartments and single houses being the most frequent.
4.3.2 Histogram of prices for each year category
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 5000000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create a histogram of prices for each year categoryhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ year_category, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Year Category",x ="Price",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 5000000))# Convert ggplot object to plotly objectinteractive_histogram_year <-ggplotly(histogram, width =750, height =1000)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50))# Display the interactive plotinteractive_histogram_year
The majority of properties were built between 2016 and 2024. Interestingly, the distribution remains similar across almost all periods.
4.3.3 Histogram of prices for each canton
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 5200000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create the histogramhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ canton, scales ="free", ncol =2) +labs(title ="Distribution by Canton for properties between 0 and 5 million",x ="Price",y ="Frequency") +theme(axis.text.y =element_text(size =2)) +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 5200000))# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =1000)# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50))# Display the interactive plotinteractive_histogram
Compared to other cantons, Geneva has a distinct distribution. Canton Vaud, Valais, Tessin, Bern, and Fribourg are where the majority of the listed properties are concentrated.
4.3.4 Histogram of prices for each number of rooms
Click to show code
# Define breaks for x-axis in millionsbreaks <-seq(0, 5200000, by =1000000)labels <-paste0(breaks/1000000, "Mio")# Create the histogramhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ number_of_rooms, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Number of Rooms",x ="Price",y ="Frequency") +theme_minimal() +scale_x_continuous(breaks = breaks, labels = labels, limits =c(0, 5200000))# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =2000) # Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50))# Display the interactive plotinteractive_histogram
The majority of properties have fewer than 10 rooms, and the distribution tends to shift slightly towards higher prices as the number of rooms increases.
4.4 Histogram of properties by square meters
To better see the data, we only show the properties with less than 1000 square meters
Click to show code
histogram <-ggplot(properties_filtered, aes(x = square_meters)) +geom_histogram(binwidth =15, fill ="skyblue", color ="black") +labs(title ="Distribution of Properties by Square Meters",x ="Square Meters",y ="Frequency") +theme_minimal() +xlim(0,1000)# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =750, height =500 ) # Adjust width and height as needed# Adjust margins to prevent x-axis label from being cut offinteractive_histogram <-layout(interactive_histogram, margin =list(l =50, r =50, b =50, t =50))# Display the interactive plotinteractive_histogram
4.5 Histogram of prices by Tax Cluster
Click to show code
# Create the boxplotboxplot <-ggplot(properties_filtered, aes(x =as.factor(Tax_cluster), y = price)) +geom_boxplot(fill ="skyblue", color ="black") +labs(title ="Boxplot of Property Prices by Tax Cluster",x ="Tax Cluster",y ="Price") +theme_minimal() +scale_y_continuous(labels = scales::unit_format(unit ="Mio", scale =1e-6), limits =c(0, 4000000))# Convert ggplot object to plotly objectinteractive_boxplot <-ggplotly(boxplot, width =750, height =500)# Display the interactive plotinteractive_boxplot
4.6 Histogram of prices by Political cluster
Click to show code
# Create the boxplotboxplot <-ggplot(properties_filtered, aes(x =as.factor(Political_cluster), y = price)) +geom_boxplot(fill ="skyblue", color ="black") +labs(title ="Boxplot of Property Prices by Political Cluster",x ="Political Cluster",y ="Price") +theme_minimal() +scale_y_continuous(labels = scales::unit_format(unit ="Mio", scale =1e-6), limits =c(0, 4000000))# Convert ggplot object to plotly objectinteractive_boxplot <-ggplotly(boxplot, width =750, height =500 )interactive_boxplot
4.7 Histogram of prices by demographic cluster
Click to show code
# Create the boxplotboxplot <-ggplot(properties_filtered, aes(x =as.factor(Demographic_cluster), y = price)) +geom_boxplot(fill ="skyblue", color ="black") +labs(title ="Boxplot of Property Prices by demographic Cluster",x ="demographic Cluster",y ="Price") +theme_minimal() +scale_y_continuous(labels = scales::unit_format(unit ="Mio", scale =1e-6), limits =c(0, 4000000))# Convert ggplot object to plotly objectinteractive_boxplot <-ggplotly(boxplot, width =750, height =500 )interactive_boxplot
5 Supervised learning
Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
Two or more models
Two or more scores
Tuning of one or more hyperparameters per model
Interpretation of the model(s)
5.1 Model 1
This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.
5.1.1 Tools and Software
The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:
corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
caret for creating training and testing sets, and managing cross-validation processes.
ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
gtsummary for creating publication-ready tables summarizing regression analysis results.
Each of these tools was chosen for its specific functionality that aids in robust data analysis, ensuring that each step of the model building and evaluation process is well-supported.
Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:
We can observe that the correlation between the number of rooms and price (0.02) and square meters and price (0.68) suggests a weak correlation with the number of rooms but a strong correlation with square meters. The number of rooms and square meters have a weak correlation (0.12), indicating they offer independent information for the model.
Question : How to make the correlation with categorical variables?
Question : Is VIF analysis really necessary and meaningful ?
####Basic Model ##### Model Building and Evaluation
The dataset was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}
Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.
Click to show code
sum(is.na(testData$price)) # Check NAs in price#> [1] 0sapply(testData, function(x) sum(is.na(x))) # Check NAs in all predictors#> zip_code price number_of_rooms #> 0 0 0 #> square_meters address canton #> 0 0 0 #> property_type floor year_category #> 0 0 0 #> Community Canton_code lon #> 0 0 0 #> lat Demographic_cluster Political_cluster #> 0 0 0 #> Tax_cluster #> 0
Click to show code
# Model Evaluation## Diagnostic Checks#plot(lm_model)#ad.test(residuals(lm_model))#use gt summary to show the resulttbl_reg_1 <- gtsummary::tbl_regression(lm_model)tbl_reg_1
Characteristic
Beta
95% CI1
p-value
number_of_rooms
25,198
12,769, 37,626
<0.001
square_meters
8,494
8,285, 8,703
<0.001
property_type
Apartment
—
—
Attic flat
129,020
47,784, 210,257
0.002
Bifamiliar house
-188,611
-265,804, -111,418
<0.001
Chalet
143,905
46,808, 241,002
0.004
Duplex
-96,168
-194,952, 2,617
0.056
Farm house
-436,410
-660,919, -211,902
<0.001
Loft
-144,419
-657,067, 368,228
0.6
Roof flat
-74,019
-188,970, 40,932
0.2
Rustic house
-109,998
-563,801, 343,804
0.6
Single house
-135,237
-180,832, -89,643
<0.001
Terrace flat
-54,410
-218,363, 109,543
0.5
Villa
192,125
119,772, 264,479
<0.001
floor
floor
—
—
eg
20,745
-25,864, 67,355
0.4
noteg
year_category
0-1919
—
—
1919-1945
236,188
129,145, 343,232
<0.001
1946-1960
294,589
193,908, 395,269
<0.001
1961-1970
239,390
154,092, 324,687
<0.001
1971-1980
315,114
239,077, 391,150
<0.001
1981-1990
297,688
220,824, 374,552
<0.001
1991-2000
355,533
275,643, 435,424
<0.001
2001-2005
498,869
402,410, 595,329
<0.001
2006-2010
560,413
475,481, 645,346
<0.001
2011-2015
592,744
510,202, 675,286
<0.001
2016-2024
453,204
388,244, 518,164
<0.001
Demographic_cluster
73,605
60,028, 87,182
<0.001
Political_cluster
-52,929
-63,080, -42,778
<0.001
Tax_cluster
-73,874
-88,207, -59,542
<0.001
1 CI = Confidence Interval
5.1.2.1.1 Assess Overfitting
Click to show code
# For the Linear Modellm_train_pred <-predict(lm_model, newdata = trainData)lm_test_pred <-predict(lm_model, newdata = testData)# Calculate RMSE and R-squared for Training Datalm_train_rmse <-sqrt(mean((trainData$price - lm_train_pred)^2))lm_train_rsquared <-summary(lm(lm_train_pred ~ trainData$price))$r.squared# Calculate RMSE and R-squared for Test Datalm_test_rmse <-sqrt(mean((testData$price - lm_test_pred)^2))lm_test_rsquared <-summary(lm(lm_test_pred ~ testData$price))$r.squared# show the results in a tableresults_table <-data.frame(Model =c("Linear Regression"),RMSE_Train = lm_train_rmse,RMSE_Test = lm_test_rmse,Rsquared_Train = lm_train_rsquared,Rsquared_Test = lm_test_rsquared)results_table#> Model RMSE_Train RMSE_Test Rsquared_Train#> 1 Linear Regression 975651 964081 0.476#> Rsquared_Test#> 1 0.499
5.1.2.1.2 Metrics
Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model)$r.squaredadj_r_sq <-summary(lm_model)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model, newdata=testData))mae <-mae(testData$price, predict(lm_model, newdata=testData))aic <-AIC(lm_model)# show those metrics in a html tablemetrics_1 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics"=5)) metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.476
0.475
964081
488516
501282
:::
5.1.2.2 Hyperparameter Tuning
5.1.2.2.1 Stepwise Regression
Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:
A Lasso and Ridge regression were also performed to compare the performance of the linear regression model with regularization techniques. We fit both models using cross-validation to determine the optimal lambda (penalty parameter). The plots show the lambda selection process for both Lasso and Ridge models. ::: {.cell layout-align=“center”}
Click to show code
# Convert data frames to matrices for glmnetdat_tr_re_mat_x <-as.matrix(trainData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])dat_tr_re_mat_y <- trainData$pricedat_te_re_mat_x <-as.matrix(testData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])dat_te_re_mat_y <- testData$price#fit lasso and ridgeset.seed(123) # For reproducibility# Fit Lasso modellasso_model <-cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha =1) # Lasso#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion# Fit Ridge modelridge_model <-cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha =0) # Ridge#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionridge_fit_best <-glmnet(x=dat_tr_re_mat_x, y = dat_tr_re_mat_y, lambda = ridge_model$lambda.min)#> Warning in storage.mode(xd) <- "double": NAs introduced by coercionlasso_fit_best <-glmnet(x=dat_tr_re_mat_x, y=dat_tr_re_mat_y, lambda = lasso_model$lambda.min) #can also use lasso_fit$lambda.1se#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion# lasso & ridge performance on the training setpostResample(predict(ridge_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 1.00e+06 4.50e-01 5.09e+05postResample(predict(lasso_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.93e+05 4.57e-01 5.00e+05# lasso & ridge performance on the test setpostResample(predict(ridge_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.96e+05 4.76e-01 5.00e+05postResample(predict(lasso_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion#> RMSE Rsquared MAE #> 9.82e+05 4.82e-01 4.90e+05# Step-wise lm performance on training and test setspostResample(predict(lm_model2,trainData), dat_tr_re_mat_y)#> RMSE Rsquared MAE #> 9.76e+05 4.76e-01 4.97e+05postResample(predict(lm_model2,testData), dat_te_re_mat_y)#> RMSE Rsquared MAE #> 9.64e+05 4.99e-01 4.89e+05
::: We then compared the performance of the Lasso and Ridge models with the stepwise linear regression model using RMSE and MAE:
Click to show code
# Calculate RMSE and MAEget_metrics <-function(predictions, actuals) { RMSE <-sqrt(mean((predictions - actuals)^2)) MAE <-mean(abs(predictions - actuals)) Rsquared <-cor(predictions, actuals)^2return(c(RMSE = RMSE, MAE = MAE, Rsquared = Rsquared) )}# Capture the performance metricsridge_train_preds <-predict(ridge_fit_best, newx = dat_tr_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlasso_train_preds <-predict(lasso_fit_best, newx = dat_tr_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionridge_test_preds <-predict(ridge_fit_best, newx = dat_te_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlasso_test_preds <-predict(lasso_fit_best, newx = dat_te_re_mat_x)#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercionlm_train_preds <-predict(lm_model2, trainData)lm_test_preds <-predict(lm_model2, testData)ridge_train_metrics <-get_metrics(ridge_train_preds, dat_tr_re_mat_y)lasso_train_metrics <-get_metrics(lasso_train_preds, dat_tr_re_mat_y)ridge_test_metrics <-get_metrics(ridge_test_preds, dat_te_re_mat_y)lasso_test_metrics <-get_metrics(lasso_test_preds, dat_te_re_mat_y)lm_train_metrics <-get_metrics(lm_train_preds, dat_tr_re_mat_y)lm_test_metrics <-get_metrics(lm_test_preds, dat_te_re_mat_y)# Create a data frame with the performance metricsperformance_df <-data.frame(Model =c("Ridge (Training)", "Lasso (Training)", "Ridge (Test)", "Lasso (Test)", "Stepwise (Training)", "Stepwise (Test)"),RMSE =c(ridge_train_metrics["RMSE"], lasso_train_metrics["RMSE"], ridge_test_metrics["RMSE"], lasso_test_metrics["RMSE"], lm_train_metrics["RMSE"], lm_test_metrics["RMSE"]),MAE =c(ridge_train_metrics["MAE"], lasso_train_metrics["MAE"], ridge_test_metrics["MAE"], lasso_test_metrics["MAE"], lm_train_metrics["MAE"], lm_test_metrics["MAE"]),Rsquared =c(ridge_train_metrics["Rsquared"], lasso_train_metrics["Rsquared"], ridge_test_metrics["Rsquared"], lasso_test_metrics["Rsquared"], lm_train_metrics["Rsquared"], lm_test_metrics["Rsquared"]))# Create the kable extra tableperformance_table <-kable(performance_df, format ="html") %>%kable_styling(full_width =FALSE, position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c( "Performance Metrics (RMSE, MAE, R-sq)"=4))# Print the tableperformance_table
Performance Metrics (RMSE, MAE, R-sq)
Model
RMSE
MAE
Rsquared
Ridge (Training)
1003013
508981
0.450
Lasso (Training)
993240
500232
0.457
Ridge (Test)
995990
500158
0.476
Lasso (Test)
981526
490425
0.482
Stepwise (Training)
975674
496834
0.476
Stepwise (Test)
964156
488684
0.499
The Stewise model is thus the best model because it has the lowest RMSE and MAE values on the test and training set. CHANGE FOR WHEN WE HAVE CORRECT DATAwe observe that the Lasso model outperforms the Ridge model in terms of RMSE and R-squared values, indicating that Lasso regularization may be more effective in this context.
5.1.2.2.3 Metrics
Here we compare the performance metrics of the initial model and the stepwise model. The metrics of our initial model : ::: {.cell layout-align=“center”}
Click to show code
metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.476
0.475
964081
488516
501282
:::
Stepwise model: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model2)$r.squaredadj_r_sq <-summary(lm_model2)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model2, newdata=testData))mae <-mae(testData$price, predict(lm_model2, newdata=testData))aic <-AIC(lm_model2)# show those metrics in a html tablemetrics_2 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics"=5)) metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.476
0.475
964156
488684
501280
:::
5.1.2.3 Cross-Validation
Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.
Here are the performance metrics for the stepwise model: ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.476
0.475
964156
488684
501280
:::
Write the comparison with stepwise model, seems robust ?
5.1.2.4 Model testing
We chose the model …. WRITE WHICH MODEL IS THE BEST AND WHY
The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:
5.1.2.4.1 Residual vs Predicted Prices
This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.
Click to show code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model2, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
As we can observe, the spread of residuals suggests potential issues with model fit, particularly for higher price predictions where the variance seems to increase.
5.1.2.4.2 Actual vs Predicted Prices
This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points. ::: {.cell layout-align=“center”}
Click to show code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
:::
5.1.3 Linear Regression between 10th and 90th percentile
To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of prices. Here, we filter the data between the 10th and 90th percentiles of the price distribution to see if the model performs better within this range:
Click to show code
#filter properties_filtered based on the 10th percentile and 90th percentile of the dataproperties_filtered <- properties_filtered %>%filter(price >=quantile(price, 0.1) & price <=quantile(price, 0.9))
5.1.3.1 Model Building and Evaluation
We then repeat the model building and evaluation process for this filtered dataset to compare the performance with the initial (best) model: ::: {.cell layout-align=“center”}
Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}
Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model_10_90)$r.squaredadj_r_sq <-summary(lm_model_10_90)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_10_90))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model_10_90, newdata=testData))aic <-AIC(lm_model_10_90)# show those metrics in a html tablemetrics_1_10_90 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics (10-90 Qt)"=5)) metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.33
0.329
531600
291884
378076
:::
Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}
Click to show code
## Performance Metricsr_sq <-summary(lm_model2_10_90)$r.squaredadj_r_sq <-summary(lm_model2_10_90)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model2_10_90))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model2_10_90, newdata=testData))aic <-AIC(lm_model2_10_90)# show those metrics in a html tablemetrics_2_10_90 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics (10-90 Qt)"=5))metrics_2_10_90
Stepwise Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.33
0.329
531600
291884
378076
:::
Here was the previous metrics of our Basic Model (without Stepwise Integration) ::: {.cell layout-align=“center”}
Click to show code
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.33
0.329
531600
291884
378076
:::
Here was the previous metrics of our stepwise model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Here was the previous metrics of our first Basic Model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Click to show code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept
RMSE
Rsquared
MAE
RMSESD
RsquaredSD
MAESD
TRUE
983467
0.475
506256
127699
0.062
25120
:::
5.1.3.4 Model testing
5.1.3.4.1 Residual vs Predicted Prices
Click to show code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model2_10_90, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
5.1.3.4.2 Actual vs Predicted Prices
Click to show code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
5.2 Model 2
5.2.1 Random Forest
We decided to implement a Random Forest model to compare its performance with the linear regression model. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes or the mean prediction of the individual trees. This model is known for its robustness and ability to handle complex relationships in the data.
5.2.1.1 Model Building and Evaluation
We split the data into training and testing sets, fit the Random Forest model and then evaluated the model using performance metrics such as R-squared, RMSE, and MAE to assess its predictive accuracy and explanatory power. ::: {.cell layout-align=“center”}
Click to show code
#split the data in training and test set 0.8/0.2set.seed(123) # for reproducibilitytrainIndex <-createDataPartition(properties_filtered$price, p=0.8, list=FALSE)trainData <- properties_filtered[trainIndex, ]testData <- properties_filtered[-trainIndex, ]#apply the RF model as a regressionrf_model <-randomForest(price ~., data=trainData, ntree=500, importance=TRUE)rf_model.pred_rf <-predict(rf_model, newdata=testData)rmse_rf <-sqrt(mean((testData$price - rf_model.pred_rf)^2))mae_rf <-mean(abs(testData$price - rf_model.pred_rf))r_sq_rf <-cor(testData$price, rf_model.pred_rf)^2# show those metrics in a html tablemetrics_rf <-kable(data.frame(r_sq_rf, rmse_rf, mae_rf), format ="html", col.names =c("R-Squared", "RMSE", "MAE")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Random Forest Model Performance Metrics"=3))metrics_rf
Random Forest Model Performance Metrics
R-Squared
RMSE
MAE
0.683
264181
195532
::: Comparing with the best model of the linear regression, we can see that the Random Forest model has a higher R-squared value and lower RMSE and MAE values, indicating better predictive accuracy. ::: {.cell layout-align=“center”}
Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.476
0.475
964156
488684
501280
:::
The plot shows the actual vs. predicted prices, with the diagonal line indicating perfect predictions. ::: {.cell layout-align=“center”}
Click to show code
plot(testData$price ~rf_model.pred_rf, col ='skyblue',xlab ='Actual Price', ylab ='Predicted Price',main ='Actual vs Predicted Price')abline(0,1, col ='darkred')
:::
5.2.1.2 Variable Importance
VI plots are a useful tool to understand the relative importance of predictors in the Random Forest model. This plot shows the importance of each predictor in the model, helping to identify key features that drive price predictions. ::: {.cell layout-align=“center”}